class: inverse center middle title-slide .top[ <img src="data:image/png;base64,#media/logos2.svg" width="100%" /> ] ### Crowdsourced acoustic open data analysis with FOSS4G tools ##### <span style='color:#ef7d00;'>Nicolas Roelandt</span>, P. Aumond, L. Moisan .bottom[ ##### <span style='color:#ef7d00;'>_FOSS4G 2022_</span>] <!-- ###### [gflowiz.github.io/presentations/FOSS4G2021.html](https://gflowiz.github.io/presentations/FOSS4G2021.html) --> ###### Press <span style='color:#ef7d00;'>P</span> to access notes --- ## Introduction Traffic noise is a major health concern : -- - 1 million healthy life years (DALYs) lost each year in Western Europe due to traffic noise [WHO 2011](https://intranet.euro.who.int/__data/assets/pdf_file/0008/136466/e94888.pdf) -- - social cost of noise in France estimated at 147 billion euros per year [ADEME 2021](https://librairie.ademe.fr/air-et-bruit/4815-cout-social-du-bruit-en-france.html) --- ## How to find problematic areas ? -- - Direct measure on the whole area is not possible -- - Traditional way is simulation from traffic counts (air, rail, road) and infrastructure <img src="data:image/png;base64,#media/carte-bruit-bouguenais.png" style="width: 50%" /> --- ### UMRAE proposal : Capture sound environment with a smartphone app. .pull-left[<img src="data:image/png;base64,#https://noise-planet.org/assets/img/noisecapture/1.2.7/NoiseCapture_Measurement_spectrogram.jpg" style="width: 50%" />] .pull-right[<img src="data:image/png;base64,#https://noise-planet.org/assets/img/noisecapture/1.2.7/NoiseCapture_Results.jpg" style="width: 50%" />] .center[[NoiseCapture is available on Google play store](https://play.google.com/store/apps/details?id=org.noise_planet.noisecapture)] ??? simulation: not real data crowdsourced data: lots of real data, large area coverage, disparate and poor quality material --- layout: false class: inverse center middle ## What can we do with the data collected by the app ? ??? The question is straight forward. In this presentation, I'll speak about the dataset, the trails we are exploring, the results we got so far and some difficulties we had and how to mitigate those. --- ## NoiseCapture dataset -- - 3 years data collection (2017-2020, still collecting) -- - 260 000 tracks worldwide -- - sound spectrum, tags and gps localization -- - ODC Open Database License v1.0 [data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W](https://data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W) ??? sound spectrum: third of octave, each 1s --- ## How to characterize of the user environment with the collected data ? -- 2 possibilities : -- - from the sound spectrum -- - from the *tags* defined by the user ??? notes - from the sound spectrum: probably the hardest. Needs pattern recognition and maybe machine learning. - From the tags: not all tracks have tags. Information can be easily analyse with statistical software. We choose the second approach in a first time because it is easiest to do but the sound spectrum will be analysed in a second stage. If the tags can help to describe the sound spectrum, it will be a real plus. --- ## Database and subset .pull-left[ - 260 422 tracks - 124 363 with tags - 50280 not indoor or tests - 47 412 duration > 5 s - 11 492 in France ] -- .pull-right[<img src="data:image/png;base64,#https://universite-gustave-eiffel.github.io/lasso-data-analysis/articles/plots/tags_repartition.png" style="width: 150%" />] --- ## Toolkit A quite simple one: -- - PostgreSQL/PostGIS -- - R -- - and lots of R packages : Tidyverse, sf, geojsonsf, stats, suncalc... ??? The toolkit is relatively simple, PostgreSQL/Postgis because the data source is a PostgreSQL dump. So in order to access it, you have to set up a database to store the data. R is my work horse for many years, it is use, among other tools, by the UMRAE team. The complexity is underneath. The large amont of R package make reproductibilty complex. --- layout: true ## Preliminary results --- ### Well known temporal sound source dynamics -- .pull-left[ <img src="data:image/png;base64,#media/animals-tags.png" style="width: 100%" /> ] -- .pull-right[ <img src="data:image/png;base64,#media/roads-tags.png" style="width: 100%" /> ] ??? The graph on the left shows the proportion of the "animals" tags around sunrise time (the center is the sunrise time of the day it was recorded). We can see a peak during the 1 hour before the sunrise. It is a well known temporal dynamic for bird songs. On the right, it is the proportion of "roads" tags (often associated with traffic noise) in local time. We can see two peaks around 8 to 9 AM and 18 to 19 PM. It is very similar to what is observed with commute times. --- ### Physical events -- .pull-left[ <img src="data:image/png;base64,#media/wind-tags.png" style="width: 90%" /> r(7) = .93 (p < 0.01) between `wind` tag proportion and the measured wind force ] -- .pull-right[ <img src="data:image/png;base64,#media/rain-tags.png" style="width: 90%" /> r(6) = 0.68 (p < 0.1) between the `rain` tag proportion and the measured rain fall ] ??? In a second time, we looked if the presence of certain tags related to physical events like the rain or the wind where coherent with the measure recorded by the national weather service. On the left, we can see the proportion of the tag "wind" regarding the wind force (Beaufort scale). The correlation is very good : O.93. On the right, this the proportion of rain tag regarding the measured rainfall. The correlation is not as good : 0.68. --- layout: true ## Reproductible Science is an issue --- ### Good - [Data available](https://data.univ-gustave-eiffel.fr/dataset.xhtml?persistentId=doi:10.25578/J5DG3W) - [Source code available](https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis) (SQL scripts and R notebooks) - Setup available -- ### Bad - Some notebooks needs work on reproductibility (and code factoring) - Environment info is too scarce (and hard to reuse) --- ## Some avenues of investigation - R package [Renv](https://rstudio.github.io/renv/articles/renv.html) - [Docker](https://www.docker.com/) - [Guix](https://guix.gnu.org/) ??? - Renv: R package that can store and recreate an R environment (R and packages versions), limited to R - Docker : small virtual machinesthat can be re-build and/or re-run. There is R and Postgis images ready to use - Guix: to my knowledge the best way to reproduce a software environment --- layout: false class: inverse center middle # Conclusion --- layout: true # Conclusion --- - Crowdsourced data can be useful for science - Known phenomena can be found in the dataset - There is still a lot to study - FOSS are a not only useful for the industry and science but <span style='color:#ef7d00;'>key for Reproductible Science</span> - Reproductible Science is hard to achieve and has to take in account as soon as possible --- layout: false class: center inverse .center[ <img src="data:image/png;base64,#media/logos2.svg" width="100%" /> ] # Thanks! Nicolas Roelandt - Univ. Gustave Eiffel [nicolas.roelandt@univ-eiffel.fr](mailto:nicolas.roelandt@univ-eiffel.fr) [@RoelandtN42](https://twitter.com/RoelandtN42) Access to code source : [github.com/Universite-Gustave-Eiffel/lasso-data-analysis](https://github.com/Universite-Gustave-Eiffel/lasso-data-analysis) Access to dataset : [noise-planet.org](http://data.noise-planet.org/index.html) Detailed articles and notebooks : [universite-gustave-eiffel.github.io/lasso-data-analysis/articles/](https://universite-gustave-eiffel.github.io/lasso-data-analysis/articles/) .bottom-left[ ###### Slides created via the R packages [xaringan](https://github.com/yihui/xaringan) and [gadenbuie/xaringanthemer](https://github.com/gadenbuie/xaringanthemer) ]